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Chapter 1 

Kernel Spectral Clustering and applications 


Abstract In this chapter we review the main literature related to kernel spectral 
clustering (KSC), an approach to clustering cast within a kernel-based optimization 
setting. KSC represents a least-squares support vector machine based formulation 
of spectral clustering described by a weighted kernel PCA objective. Just as in the 
classifier case, the binary clustering model is expressed by a hyperplane in a high 
dimensional space induced by a kernel. In addition, the multi-way clustering can 
be obtained by combining a set of binary decision functions via an Error Correct¬ 
ing Output Codes (ECOC) encoding scheme. Because of its model-based nature, 
the KSC method encompasses three main steps: training, validation, testing. In the 
validation stage model selection is performed to obtain tuning parameters, like the 
number of clusters present in the data. This is a major advantage compared to clas¬ 
sical spectral clustering where the determination of the clustering parameters is un¬ 
clear and relies on heuristics. Once a KSC model is trained on a small subset of 
the entire data, it is able to generalize well to unseen test points. Beyond the basic 
formulation, sparse KSC algorithms based on the Incomplete Cholesky Decompo¬ 
sition (ICD) and Lq, Li,Lq+ L\, Group Lasso regularization are reviewed. In that 
respect, we show how it is possible to handle large scale data. Also, two possible 
ways to perform hierarchical clustering and a soft clustering method are presented. 
Einally, real-world applications such as image segmentation, power load time-series 
clustering, document clustering and big data learning are considered. 


1.1 Introduction 

Spectral clustering (SC) represents the most popular class of algorithms based on 
graph theory (Chung 1997). It makes use of the Laplacian’s spectrum to partition 
a graph into weakly connected sub-graphs. Moreover, if the graph is constructed 
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based on any kind of data (vector, images etc.), data clustering can be performecj^ 
SC began to be popularized when Shi and Malik introduced the Normalized Cut 
criterion to handle image segmentation (Shi & Malik 2000). Afterwards, Ng and 
Jordan (Ng et al. 2002) in a theoretical work based on matrix perturbation theory 
have shown conditions under which a good performance of the algorithm is ex¬ 
pected. Finally, in the tutorial by Von Luxburg the main literature related to SC 
has been exhaustively summarized (von Luxburg 2007). Although very success¬ 
ful in a number of applications, SC has some limitations. For instance, it can¬ 
not handle big data without using approximation methods like the Nystrom algo¬ 
rithm (Fowlkes et al. 2004, Williams & Seeger 2001), the power iteration method 
(Lin & Cohen 2010), or linear algebra based methods (Ning et al. 2010, Dhanjal 
et al. 2013, Frederix & Van Barel 2013). Furthermore, the generalization to out-of- 
sample data is only approximate. 

These issues have been recently tackled by means of a spectral clustering algo¬ 
rithm formulated as weighted kernel PCA (Alzate & Suykens 2010). The technique, 
named kernel spectral clustering (KSC), is based on solving a constrained opti¬ 
mization problem in a primal-dual setting. In other words, KSC is a Least Squares 
Support Vector Machine (LS-S VM (Suykens et al. 2002)) model used for clustering 
instead of classificatior0 By casting SC in a learning framework, KSC allows to 
rigorously select tuning parameters such as the natural number of clusters which are 
present in the data. Also, an accurate prediction of the cluster memberships for un¬ 
seen points can be easily done by projecting test data in the embedding eigenspace 
learned during training. Furthermore, the algorithm can be tailored to a given appli¬ 
cation by using the most appropriate kernel function. Beyond that, by using sparse 
formulations and a fixed-size (Suykens et al. 2002, De Brabanter et al. 2010) ap¬ 
proach, it is possible to readily handle big data. Finally, by means of adequate adap¬ 
tations of the core algorithm, hierarchical clustering and a soft clustering approach 
have been proposed. 

All these topics will be detailed in the next Sections. Precisely, after present¬ 
ing the basic KSC method, the soft KSC algorithm will be summarized. Next, 
two possible ways to accomplish hierarchical clustering will be explained. After¬ 
wards, some sparse formulations based on the Incomplete Cholesky Decompo¬ 
sition (ICD) and Lq, L\,L(, + L\, Group Lasso regularization will be described. 
Lastly, various interesting applications in different domains such as computer vi¬ 
sion, power-load consumer profiling, information retrieval and big data clustering 
will be illustrated. All these examples assume a static setting. Concerning other 
applications in a dynamic scenario the interested reader can refer to (Langone, 
Alzate, De Ketelaere & Suykens 2013, Langone et al. 2015) for fault detection, 
to (Langone, Agudelo, De Moor & Suykens 2014) for incremental time-series clus¬ 
tering, to (Langone, Alzate & Suykens 2013, Langone & Suykens 2013, Langone, 


* In this case the given data points represent the node of the graph and their similarity the corre¬ 
sponding edges. 

^ This is a considerable novelty, since SVMs are typically known as classifiers or function approx¬ 
imation models rather than clustering teehniques. 
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Mall & Suykens 2014) in case of community detection in evolving networks and 
(Peluffo et al. 2013) in relation to human motion tracking. 


1.2 Notation 


x^ 

Transpose of the vector x 


Transpose of the matrix A 

In 

N xN Identity matrix 

Itv 

N X 1 Vector of ones 


Training sample of Mr data points 

(pi-) 

Feature map 


Feature space of dimension d/, 

{'^p}p=l 

Partitioning composed of k clusters 


Set of N vertices Y and m 

l-l 

Cardinality of a set 


1.3 Kernel Spectral Clustering (KSC) 


1.3.1 Mathematical formulation 


1.3.1.1 Training problem 


The KSC formulation for k clusters is stated as a combination of k — 1 binary 
problems (Alzate & Suykens 2010). In particular, given a set of training data 
^tr = the primal problem is; 


min 

subject to 


^ /=i ^ i=\ 


( 1 . 1 ) 


The = \e\\... ^e'P are the projections of the training data mapped 

in the feature space along the direction . For a given point x,, the model in the 
primal form is: 

e«=w«>(x,)+fe/. (1.2) 

The primal problem ([U) expresses the maximization of the weighted variances 
of the data given by and the contextual minimization of the squared 

norm of the vector VI. The regularization constants Yi € mediate the 
model complexity expressed by with the correct representation of the train¬ 
ing data. V G the weighting matrix and <l> is the Aq. x d/, feature matrix 
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<1> = [(p{xiY;...;(p{xN^^Y], where ^ —>■ denotes the mapping to a high¬ 

dimensional feature space, b/ are bias terms. 

The dual problem corresponding to the primal formulation ([U), by setting V = 
D ^ become^ 

=A.ia''‘'> (1.3) 

where Q is the kernel matrix with ij-th entry Qjj — K{xi,Xj) = <p{xiY(pixf). 
fC : X > K means the kernel function. The type of kernel function to utilize 

is application-dependent, as it is outlined in Table o The matrix D is the graph 
degree matrix which is diagonal with positive elements Du = Mo is a cen¬ 

tering matrix defined as Mo = Inh — q-ij —*’ *^he are vectors of 

dual variables, Xi = x —>■ K is the kernel function. The dual clustering 

model for the /-th point can be expressed as follows: 

= £ afK{xj,Xi) +bi,j=\,...,N,,,l = l,...,k-\. (1.4) 

f=i 

The cluster prototypes can be obtained by binarizing the projections eP as sign(e-^^). 
This step is straightforward because, thanks to presence of the bias term bi, both the 
and the variables get automatically centred around zero. The set of the most 
frequent binary indicators form a code-book where each code-word 

of length k—\ represents a cluster. 


Application 

Kernel Name 

Mathematieal Expression 

Vector data 

RBF 

K{Xi,Xj) =exp(-||x;-Xj||j/cT^) 

Images 

RBF;j2 

vT 

K(/t(‘),/tO)) =exp{-^) 

Text 

Cosine 

Kixi.Xj) = II 7,^-' II 

Time-series 

RBFcd 

K{xi,Xj) = exp(-||xi 


Table 1.1: Types of kernel functions for different applications. In this Table RBF 
means Radial Basis Function, a denotes the bandwidth of the kernel. The symbol 
indicates a color histogram representing the i—th pixel of an image, and to com¬ 
pare two histograms and the statistical test is used (Puzicha et al. 1997). 
Regarding time-series data, the symbol cd means correlation distance (Liao 2005), 

and ||x; — Xy llcd = —Rij), where Rtj can indicate the Pearson or Spearman’s 

rank correlation coefficient between time-series Xi and Xj. 


Interestingly, problem ( |1.3| l has a close connection with SC based on a random 
walk Laplacian. In this respect, the kernel matrix can be considered as a weighted 


^ By choosing V = I, problem ' 1.3 
et al. 1998, Mika et al. 1999). 


is identical to kernel PCA (Suykens et al. 2003, Scholkopf 
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graph = {y ^ S') with the nodes v; G y represented by the data points Xi. This 
graph has a corresponding random walk in which the probability of leaving a ver¬ 
tex is distributed among the outgoing edges according to their weight; = Ppt, 
where P = D ^Q. indicates the transition matrix with the ij-th entry denoting the 
probability of moving from node i to node j in one time-step. Moreover, the sta¬ 
tionary distribution of the Markov Chain describes the scenario where the random 
walker stays mostly in the same cluster and seldom moves to the other clusters 
(Meila & Shi 2001^, Meila & Shi 2001b, Meila & Shi 2001a, Delvenne et al. 2010). 


1.3.1.2 Generalization 

Given the dual model parameters and bi, it is possible to assign a membership 
to unseen points by calculating their projections onto the eigenvectors computed in 
the training phase: 

= (1.5) 

where f2test is the A^est x N kernel matrix evaluated using the test points with en¬ 
tries f2test,ri = r = 1,...,Attest, i = 1) ■ ■ ■ )Mr- The cluster indicator for a 

given test point can be obtained by using an Error Correcting Output Codes (ECOC) 
decoding procedure: 

• the score variable is binarized 

• the indicator is compared with the training code-book 'SSS (see previous Sec¬ 
tion), and the point is assigned to the nearest prototype in terms of Hamming 
distance. 

The KSC method, comprising training and test stage, is summarized in algorithm 
and the related Matlab package is freely available on the Wel:|^ 


1.3.1.3 Model selection 

In order to select tuning parameters like the number of clusters k and eventually 
the kernel parameters, a model selection procedure based on grid search is adopted. 
Eirst, a validation set ^vai = is sampled from the whole dataset. Then, a grid 

of possible values of the tuning parameters is constructed. Afterwards, a KSC model 
is trained for each combination of parameters and the chosen criterion is evaluated 
on the partitioning predicted for the validation data. Einally, the parameters yielding 
the maximum value of the criterion are selected. Depending on the kind of data, a 
variety of model selection criteria have been proposed; 

• Balanced Line Fit (BLF). It indicates the amount of collinearity between vali¬ 
dation points belonging to the same cluster, in the space of the projections. It 


http://www.esat.kuleuven.be/stadius/ADB/alzate/softwareKSClab.php 
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Algorithm 1: KSC algorithm (Alzate & Suykens 2010) 

Data: Training set ^tr = {xi}f^i, test set latest = kernel function 

K : X —> R positive definite and localized (K{xi,Xj) —> 0 if Xj and xj belong to 

different clusters), kernel parameters (if any), number of clusters k. 

Result: Clusters {M,■ • • iMr}. codebook V3S = with {cp} 6 { — 1,1}*^^*. 

1 compute the training eigenvectors I = I,... ,k — 1, corresponding to the k — 1 largest 
eigenvalues of problem 

2 let A 6 0 be the matrix containing the vectors ^ q;(<: ') as columns 

3 binarize A and let the code-book ‘^3S = be composed by the k encodings of 

Q = sign(A) with the most occurrences 

4 Vi, ! = 1,... ,Wtr. assign x; to Ap* where p* = argminp(fH(sign{a;),Cp) and .) is the 
Hamming distance 

5 binarize the test data projections sign(eM^), m = 1,... ,A)est, and let sign(e,„) 6 
be the encoding vector of 

6 Vm, assignxjjj"' to Ap», where p* = argminprf;/(sign(em),Cp). 


reaches its maximum value 1 in case of well separated clusters, represen ted a s 


l.li 


lines in the space of the (see for instance the bottom left side of Figure 
Balanced Angular Fit or BAF (Mall et al. 2013/?). For every cluster, the sum 
of the cosine similarity between the validation points and the cluster prototype, 
divided by the cardinality of that cluster, is computed. These similarity values are 
then summed up and divided by the total number of clusters. 

Average Membership Strength abbr AMS (Langone, Mall & Suykens 2013). The 
mean membership per cluster denoting the mean degree of belonging of the val¬ 
idation points to the cluster is computed. These mean cluster memberships are 
then averaged over the number of clusters. 

Modularity (Newman 2006). This quality function is well suited for network 
data. In the model selection scheme, the Modularity of the validation sub-graph 
corresponding to a given partitioning is computed, and the parameters related to 
the highest Modularity are selected (Langone et al. 2011, Langone et al. 2012). 
Fisher Criterion. The classical Fisher criterion (Bishop 2006) used in classifica¬ 
tion has been adapted to select the number of clusters k and the kernel parameters 
in the KSC framework (Alzate & Suykens 2012). The criterion maximizes the 
distance between the means of the two clusters while minimizing the variance 


within each cluster, in the space of the projections e 


{!) 

val' 


In Figure |L1| an example of clustering obtained by KSC on a synthetic dataset 
is shown. The BLF model selection criterion has been used to tune the bandwidth 
of the RBF kernel and the number of clusters. It can be noticed how the results are 
quite accurate, despite the fact that the clustering boundaries are highly nonlinear. 
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Fig. 1.1: KSC partitioning on a toy dataset. (Top) Original dataset consisting of 
3 clusters (left) and obtained clustering results (right). (Bottom) Points represented 
in the space of the projections (left), for an optimal choice of k (and CJ^ = 

4.36 • 10^^) suggested by the BLF criterion (right). We can notice how the points 
belonging to one cluster tend to lie on the same line. A perfect line structure is not 
attained due to a certain amount of overlap between the clusters. 


1.3.2 Soft Kernel Spectral Clustering (SKSC) 

Soft kernel spectral clustering (SKSC) makes use of algorithm in order to com¬ 
pute a first hard partitioning of the training data. Next, soft cluster assignments are 
performed by computing the cosine distance between each point and some clus¬ 
ter prototypes in the space of the projections In particular, given the projec¬ 
tions for the training points e, = *^], i = 1,... ,Nxi and the correspond¬ 

ing hard assignments q^’ we can calculate for each cluster the cluster prototypes 
Sp G as: 


1 

Sp = —Y^ei ( 1 . 6 ) 

where Up is the number of points assigned to cluster p during the initialization step 
by KSC. Then the cosine distance between the i-th point in the projections space 
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and a prototype Sp is calculated by means of the following formula: 


df°^ = l-eJsp/i\\eM\^p\\2). 

The soft membership of point i to cluster q can be finally expressed as: 


smf'^ = • 




4=1 n 




(1.7) 


( 1 . 8 ) 


with C 


Sp) 


= 1. As pointed-out in (Ben-Israel & lyigun 2008), this member¬ 


ship represents a subjective probability expressing the belief in the clustering as¬ 
signment. 

The out-of-sample extension on unseen data consists simply of calculating eq. 


(1.51 and assigning the test projections to the closest centroid. 


An example of soft clustering performed by SKSC on a synthetic dataset is de¬ 
picted in Figure [L^ The AMS model selection criterion has been used to select the 
bandwidth of the RBF kernel and the optimal number of clusters. The reader can 
appreciate how SKSC provides more interpretable outcomes compared to KSC. 

The SKSC method is summarized in algorithm and a Matlab implementation 
is freely downloadabl^ 


Algorithm 2: SKSC algorithm (Langone, Mall & Suykens 2013) 

Data: Training set ^tr = and test set i^test = kernel function 

K : X R positive definite and localized (K(xi,Xj) —> 0 if Xj and xj belong to 

different clusters), kernel parameters (if any), number of clusters k. 

Result: Clusters {^i,s^p ,..., £4}> soft cluster memberships sm^P) ,p= I,... ,k, cluster 
prototypes ^ Sp 6 R*^“'. 

1 Initialization by solving eq. i 1.4 1 . 


2 Compute the new prototypes si,.. .,st- (eq. ' 1.6 ). 


3 Calculate the test data projections elS , m = 1,..., Mest, I = \ ,...,k— 1. _ 

4 Find the cosine distance between each projection and all the prototypes (eq. fTTjljVm, 
assign to cluster Ap with membership sm^P) according to eq. 1 1.8 i. 


1.3.3 Hierarchical Clustering 

In many cases, clusters are formed by sub-clusters which in turn might have sub¬ 
structures. As a consequence, an algorithm able to discover a hierarchical organiza¬ 
tion of the clusters provides a more informative result, incorporating several scales 


5 


http://www. esat. kuleuven. be/stadius/ADB/langone/softwareSKSClab.php 
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Fig. 1.2: SKSC partitioning on a synthetic dataset. (Top) Original dataset consist¬ 
ing of 2 clusters (left) and obtained soft clustering results (right). (Bottom) Points 
represented in the space of the projection (left), for an optimal choice of k (and 
(J^ = 1.53 • 10^^) as detected by the AMS criterion (right). 


in the analysis. The flat KSC algorithm has been extended in two ways in order to 
deal with hierarchical clustering. 


1.3.3.1 Approach 1 


This approach, named hierarchical kernel spectral clustering (HKSC), was proposed 
in (Alzate & Suykens 2012) and exploits the information of a multi-scale structure 
present in the data given by the Fisher criterion (see end of Section 1.3.1.3 1 . A grid 
search over different values of k and is performed to find tuning parameter pairs 
such that the criterion is greater than a specified threshold value. The KSC model 
is then trained for each pair and evaluated at the test set using the out-of-sample 
extension. A specialized linkage criterion determines which clusters are merging 

















10 


1 Kernel Specti'al Clustering and applications 


based on the evolution of the cluster memberships as the hierarchy goes up. The 
whole procedure is summarized in algorithm]^ 


Algorithm 3: HKSC algorithm (Alzate & Suykens 2012) 

Data: Training set ^tr = Validation set ®val = and test set 

latest = RBF kernel function with parameter maximum number of 

clusters set of R values {a^,... ,a^}, Fisher threshold 6. 

Result: Linkage matrix Z 

1 For every combination of parameter pairs {k, O") train a KSC model using algorithm|^ 
predict the cluster memberships for validation points and calculate the related Fisher 
criterion 

2 Vk, find the maximum value of the Fisher criterion across the given range of values. If the 
maximum value is greater than the Fisher threshold 9, create a set of these optimal (kt,al) 
pairs. 

3 Using the previously found (k^^al) pairs train a clustering model and compute the cluster 
memberships for the test set using the out-of-sample extension. 

4 Create the linkage matrix Z by identifying which clusters merge starting from the bottom of 
the tree which contains max k, clusters. 


1.3.3.2 Approach 2 

In (Mall, Langone & Suykens 2014/?) and (Mall, Langone & Suykens 2014a) an 
alternative hierarchical extension of the basic KSC algorithm was introduced, for 
network and vector data respectively. In this method, called agglomerative hierar¬ 
chical kernel spectral clustering (AH-KSC), the structure of the projections in the 
eigenspace is used to automatically determine a set of increasing distance thresh¬ 
olds. At the beginning, the validation point with maximum number of similar points 
within the first threshold value is selected. The indices of all these points represent 
the first cluster at level 0 of hierarchy. These points are then removed from the val¬ 
idation data matrix, and the process is repeated iteratively until the matrix becomes 
empty. Thus, the first level of hierarchy corresponding to the first distance threshold 
is obtained. To obtain the clusters at the next level of hierarchy the clusters at the 
previous levels are treated as data points, and the whole procedure is repeated again 
with other threshold values. This step takes inspiration from (Blondel et al. 2008). 
The algorithm stops when only one cluster remains. The same procedure is applied 
in the test stage, where the distance thresholds computed in the validation phase are 
used. An overview of all the steps involved in the algorithm is depicted in Figure 
0 In Figure [1~4| an example of hierarchical clustering performed by this algorithm 
on a toy dataset is shown. 
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Fig. 1.3; AH-KSC algorithm. Steps of AH-KSC method as described in (Mall, 
Langone & Suykens 2014h) with addition of the step where the optimal a and k are 
estimated. 
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Fig. 1.4: AH-KSC partitioning on a toy dataset. Cluster memberships for a toy 
dataset at different hierarchical levels obtained by the AH-KSC method. 


1.3.4 Sparse Clustering Models 

The computational complexity of the KSC algorithm depends on solving the eigen¬ 
value problem ( |1.3| ) related to the training stage and computing eq. ( |1.5| l which 
gives the cluster memberships of the remaining points. Assuming that we have Atot 
data and we use Atr points for training and Atest = Nm — Nn as test set, the runtime 
of algorithmis (9(Atj.) + (9(An-Atest)- In order to reduce the computational com¬ 
plexity, it is then necessary to hnd a reduced set of training points, without loosing 
accuracy. In the next Sections two different methods to obtain a sparse KSC model, 
based on the Incomplete Cholesky Decomposition (ICD) and L\ and Lq penalties 
respectively, are discussed. In particular, thanks to the ICD, the KSC computational 
complexity for the training problem is decreased to 6>(/?^Atr) (Novak et al. 2015), 
where R indicates the reduced set size. 
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1.3.4.1 Incomplete Cholesky Decomposition 


One of the KKT optimality conditions characterizing the Lagrangian of problem 
( |1.1| ) is: 


w 


Mr 


(0 _ ^ aP(p{xi). 


(1.9) 


1=1 


From eq. ( |1.9| l it is evident that each training data point contributes to the primal 
variable resulting in a non-sparse model. In order to obtain a parsimonious 
model a reduced set method based on the Incomplete Cholesky Decomposition 
(ICD) was proposed in (Alzate & Suykens 2011, Novak et al. 2015). The tech¬ 
nique is based on finding a small number R ^ Ntr of points ^ and related 

coefficients with the aim of approximating as: 


r=l 


( 1 . 10 ) 


As a consequence, the projection of an arbitrary data point x into the training em¬ 
bedding is given by: 




= Y, Cr''’K{x,Xr)+bi. 


( 1 . 11 ) 


The set of points can be obtained by considering the pivots of the ICD performed 
on the kernel matrix Q. In particular, by assuming that Q has a small numerical 
rank, the kernel matrix can be approximated by = GG^, with G G 

If we plug in this approximated kernel matrix in problem o, the KSC eigenvalue 
problem can be written as: 

= ( 1 . 12 ) 


where U G and V G denotes the left and right singular vectors deriving 

from the singular value decomposition (SVD) of G, and T' G RN rxMr j g fjjg matrix 
of the singular values. If now we pre-multiply both sides of eq. (1.12i by and 
replace 5^9 = a (9, only the following eigenvalue problem of size R xR must be 
solved: 

= Xi5‘'‘\l = l,...,k. (1.13) 


The approximated eigenvectors of the original problem ( |1.3| l can be computed as 
o:(0 =u§{i)^ and the sparse parameter vector can be found by solving the following 
optimization problem: 


min^^i) II ||^=m/n^(;) || <p'^ II 2 • (1-14) 


The corresponding dual problem can be written as follows: 




(1.15) 
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where = K{xr,Xs), = K{xr,Xi), r,s = = l,...,Ntr and I — 

Since the size R of problem ( |1.13| l can be much smaller than the size A^tr 
of the starting problem, the sparse KSC methoc0is suitable for big data analytics. 


1.3.4.2 Using Additional Penalty terms 


In this part we explore sparsity in the KSC technique by using an additional penalty 
term in the objective function ( |1.14| i. In (Alzate & Suykens 2011), the authors used 
an L\ penalization term in combination with the reconstruction error term to in¬ 
troduce sparsity. It is well known that the Li regularization introduces sparsity as 
shown in (Zhu et al. 2003). However, the resulting reduced set is neither the spars¬ 
est nor the most optimal w.r.t. the quality of clustering for the entire dataset. In 
(Mall, Mehrkanoon, Langone & Suykens 2014), we introduced alternative penal¬ 
ization techniques like Group Lasso (Yuan & Lin 2006) and (Friedman et al. 2010), 
L() and Li + Lq penalizations. The Group Lasso penalty is ideal for clusters as it 
results in groups of relevant data points. The Lq regularization calculates the num¬ 
ber of non-zero terms in the vector. The Lp-norm results in a non-convex and NP- 
hard optimization problem. We modify the convex relaxation of Lp-norm based on 
an iterative re-weighted Li formulation introduced in (Candes et al. 2008, Huang 
et al. 2010). We apply it to obtain the optimal reduced sets for sparse kernel spectral 
clustering. Below we provide the formulation for Group Lasso penalized objective 
( L16|l and re-weighted Li-norm penalized objectives ( L17|l. 


The Group Lasso (Yuan & Lin 2006) based formulation for our optimization 
problem is: 


min ||0To;_0T^||2_^;t^ V^||^/|l 2 , (L16) 

where (P = [(j){xi),... ,(1 ){xn„)], cc = ...,a S KYrx(A:-i) p _ 

[/3i ,...p G RYrx(i:-i) ^ Q,(i) g ^N,r p. g and we set ^/pj 

as the fraction of training points belonging to the cluster to which the training 
point belongs. By varying the value of A we control the amount of sparsity intro¬ 
duced in the model as it acts as a regularization parameter. In (Friedman et al. 2010), 
the authors show that if the initial solutions are ■ ■ jl^Ntr '^^en if ||Yj^(y — 

Y,j3,)|| < A, then j3/ is zero otherwise it satisfies: = (X^Xi -f A/||j3/||)^^Y^^r/ 
where n =y-'Li^iXiPi. 

Analogous to this, the solution to the group lasso penalization for our problem 
can be defined as: ||p (x/) (^>Ta — 0 {xi)Pi)\\ < A then pi is zero otherwise it sat¬ 
isfies: Pi = {0'^<P + X/\\Pi\\)^^(j){xi)ri where r/ = — '^i^i(j)(xi)Pi. The Group 

Lasso penalization technique can be solved by a blockwise co-ordinate descent pro- 

® A C implementation of the algorithm can be downloaded at: 
http://www.esat.kuleuven.be/stadius/ADB/novak/softwareKSCICD.php 
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cedure as shown in (Yuan & Lin 2006). The time complexity of the approach is 
C>(maxiter * where maxiter is the maximum number of iterations specified 

for the co-ordinate descent procedure and k is the number of clusters obtained via 
KSC. From our experiments we observed that on an average 10 iterations suffice for 
convergence. 

Concerning the re-weighted Li procedure, we modify the algorithm related to 
classification as shown in (Huang et al. 2010) and use it for obtaining the reduced 
set in our clustering setting; 

min 

j3GR"(rX{*-l) 

such that 

where A is matrix of the same size as the matrix i.e. A € -pj^g (•gj.jjj 

11A 11 2 along with the constraint | j 11 2 < £; corresponds to the Lp-norm penalty on j3 
matrix. A matrix is initially defined as a matrix of ones so that it gives equal chance 
to each element of j3 matrix to reduce to zero. The constraints on the optimization 
problem forces each element of j3, G to reduce to zero. This helps to overcome 

the problem of sparsity per component which is explained in (Alzate & Suykens 
2011). The p variable is a regularizer which controls the amount of sparsity that is 
introduced by solving this optimization problem. 

In Figure fTTSj an example of clustering obtained using the group lasso formulation 
( |1.16| l on a toy dataset is depicted. We can notice how the sparse KSC model is able 
to obtain high quality generalization using only 4 points in the training set. 


\\0^a-0^P\\l + p^et + \\AI5\\l 

i=\ 

\\lii\\ 2 <ei,i=l,---,N,r 

e, > 0, 


1.4 Applications 

The KSC algorithm has been successfully used in a variety of applications in dif¬ 
ferent domains. In the next Sections we will illustrate various results obtained in 
different fields such as computer vision, information retrieval and power load con¬ 
sumer segmentation. 


1.4.1 Image Segmentation 

Image segmentation relates to partitioning a digital image into multiple regions, 
such that pixels in the same group share a certain visual content. In the experiments 
performed using KSC only the color information is exploited in order to segment the 
given image^ More precisely, a local color histogram with a 5 x 5 pixels window 

^ The images have been extracted from the Berkeley image database (Martin et al. 2001). 
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Fig. 1.5; Sparse KSC on toy dataset. (Top) Gaussian mixture with three highly 
overlapping components (Center) Clustering results, where the reduced set points 
are indicated with red circles (Bottom) Generalization boundaries. 


around each pixel is computed using minimum variance color quantization of 8 
levels. Then, in order to compare the similarity between two histograms h^‘') and h^-'\ 

xf' 

the positive definite kernel — exp(-y) has been adopted (Fowlkes 


et al. 2004). The symbol Xij denotes the Xtj statistical test used to compare two 
probability distributions (Puzicha et al. 1997), Oy, as usual indicates the bandwidth 
of the kernel. In Figure 1.6 an example of segmentation obtained using the basic 
KSC algorithm is given. 
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so 100 150 200 50 100 ISO 200 


Fig. 1.6; Image segmentation. (Left) Original image (Right) Segmentation given 
by KSC. 


1.4.2 Scientific Journal Clustering 

We present here an integrated approach for clustering scientific journals using KSC. 
Textual information is combined with cross-citation information in order to obtain 
a coherent grouping of the scientific journals and to improve over existing journal 
categorizations. The number of clusters k in this scenario is fixed to 22 since we 
want to compare the results with respect to the 22 essential science indicators (ESI) 
shown in Table 11.21 


Field 

Name 


Field 

Name 

1 

Agricultural sciences 


12 

Mathematics 

2 

Biology and biochemistry 


13 

Microbiology 

3 

Chemistry 


14 

Molecular biology & genetics 

4 

Clinieal medicine 


15 

Multidisciplinary 

5 

Computer seience 


16 

Neuroscience & behavior 

6 

Economics and business 


17 

Pharmacology & toxicology 

7 

Engineering 


18 

Physics 

8 

Environment/Ecology 


19 

Plant & animal science 

9 

Geosciences 


20 

Psychology / Psychiatry 

10 

Immunology 


21 

Social sciences 

11 

Materials sciences 


22 

Space science 


Table 1.2; The 22 science fields according to the essential science indicators (ESI) 


The data correspond to more than six million scientific papers indexed by the 
Web of Science (WoS) in the period 2002 — 2006. The type of manuscripts consid¬ 
ered is article, letter, note and review. Textual information has been extracted from 
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titles, abstracts and keywords of each paper together with citation information. From 
these data, the resulting number of journals under consideration is 8,305. 

The two resulting datasets contain textual and cross-citation information and are 
described as follows: 

• Term/Concept by Journal dataset: The textual information was processed us¬ 
ing the term frequency - inverse document frequency (TF-IDF) weighting pro¬ 
cedure (Baeza-Yates & Ribeiro-Neto 1999). Terms which occur only in one 
document and stop words were not considered into the analysis. The Porter 
stemmer was applied to the remaining terms in the abstract, title and key¬ 
word fields. This processing leads to a term-by-document matrix of around 
six million papers and 669,860 term dimensionality. The final journal-by-term 
dataset is a 8,305 x 669,860 matrix. Additionally, latent semantic indexing (LSI) 
(Deerwester et al. 1990) was performed on this dataset to reduce the term dimen¬ 
sionality to 200 factors. 

• Journal cross-citation dataset: A different form of analyzing cluster informa¬ 
tion at the journal level is through a cross-citation graph. This graph contains 
aggregated citations between papers forming a journal-by-journal cross-citation 
matrix. The direction of the citations is not taken into account which leads to an 
undirected graph and a symmetric cross-citation matrix. 

The cross-citation and the text/concept datasets are integrated at the kernel level by 
considering the following linear combination of kernel matrice^ 

^integr ^ p^cross-cit (• j _ p)^text 


where 0 < p < 1 is a user-defined integration weight which value can be obtained 
from internal validation measures for cluster distortior0 ^cross-cit cj-qss- 

citation kernel matrix with /y'-th entry 


_ i^/_cross-cit „cross-cit 

— ,X 


), is 


the /-th journal represented in terms of cross-citation variables, is the textual 
kernel matrix with ij-th entry is the /-th journal repre¬ 

sented in terms of textual variables and i,j= 

The KSC outcomes are depicted in Tables |1.3| and |1.4| In particular. Table 


1.3 shows the results in terms of internal validation of cluster quality, namely 


mean silhouette value (MSV) (Rousseeuw 1987) and Modularity (Newman & 
Girvan 2004, Newman 2006), and in terms of agreement with existing categoriza¬ 
tions (adjusted rand index or ARI (Hubert & Arabie 1985) and normalized mutual 
information (NMI (Strehl & Ghosh 2002)). Finally, Table 1.4 shows the top 20 terms 
per cluster, which indicate a coherent structure and illustrate that KSC is able to de¬ 
tect the text categories present in the corpus. 


* Here we use the cosine kernel described in Table |l.l] 

® In our experiments we used the mean silhouette value (MSV) as an internal cluster validation 
criterion to select the value of p which gives more coherent clusters. 
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Internal validation 

External validation 

MSV 

textual 

MSV 

cross-cit. 

MSV 

integrated 

Modularity 

cross-cit. 

Modularity 
ISI 254 

ARl 
22 ESI 

NMI 

22 ESI 

22 ESI fields 

0.057 

0.016 

0.063 

0.475 

0.526* 

1.000 

1.000 

Cross-citations 

0.093 

0.057 

0.189 

0.547 

0.442 

0.278 

0.516 

Textual (LSI) 

0.118 

0.035 

0.130 

0.505 

0.451 

0.273 

0.516 

Hierarch. Ward’s method p = 0.5 

0.121 

0.055 

0.190 

0.547 

0.488 

0.285 

0.540 

Integr. Terms+Cross-citations p = 0.5 

0.138 

0.064 

0.201 

0.533 

0.465 

0.294 

0.557 

Integr. LSI+Cross-citations p = 0.5 

0.145 

0.062 

0.197 

0.527 

0.465 

0.308 

0.560 


Table 1.3: Text clustering quality. Spectral clustering results of several integration 
methods in terms of mean Silhouette value (MSV), modularity, adjusted Rand index 
(ARl) and normalized mutual information (NMI). The first four rows correspond to 
existing clustering results used for comparison. The last two rows correspond to 
the proposed spectral clustering algorithms. For external validation, the clustering 
results are compared with respect to the 22 ESI fields and the ISI 254 subject cate¬ 
gories. The highest value per column is indicated in bold while the second highest 
value appears in italic. For MSV, a standard t-test for the difference in means re¬ 
vealed that differences between highest and second highest values are statistically 
significant at the 1% significance level (p-value < 10^). The selected method for 
further comparisons is the integrated LSI-tCross-citations approach since it wins 
in external validation with one highest value (NMI) and one second highest value 
(Modularity). 


1.4.3 Power Load Clustering 


Accurate power load forecasts are essential in electrical grids and markets partic¬ 
ularly for planning and control operations (Alzate et al. 2009). In this scenario, 
we apply KSC for finding power load smart meter data that are similar in order 
to aggregate them and improve the forecasting accuracy of the global consump¬ 
tion signal. The idea is to fit a forecasting model on the aggregated load of each 
cluster (aggregator). The k predictions are summed to form the final disaggregated 
prediction. The number of clusters and the time series used for each aggregator are 
determined via KSC (Alzate & Sinn 2013). The forecasting model used is a peri¬ 
odic autoregresive model with exogenous variables (PARK) (Espinoza et al. 2005). 


Table 1.7 (taken from (Alzate & Sinn 2013) shows the model selection and disag¬ 
gregation results. Several kernels appropriate for time series were tried including a 
Vector Autoregressive (VAR) kernel [Add: Cuturi, Autoregressive kernels for time 
series, arXiv], Triangular Global Alignment (TGA) kernel [Add: Cuturi, Fast Global 
Alignment Kernels, ICML 2011] and an RBF kernel with Spearman’s distance. The 
results show an improvement of 20.55% with the similarity based on Spearman’s 
corrleation in the forecasting accuracy compared to not using clustering at all (i.e., 
aggregating all smart meters). The BEE was also able to detect the number of clus¬ 
ters that maximize the improvement (6 clusters in this case). 
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Kernel 

Model Selection 

Disaggregated Forecast 
Baseline MAPE = 3.2(5% 

k* 

BLF(k*) 

k* 

MAPE(k*) 

Gain 

VAR 

7 

0.49 

13 

2.85% 

12.68% 

TGA 

5 

0.f)3 

8 

2.61% 

20.04% 

Spearman 

6 

0.59 

6 

2.59% 

20.55% 

RBF-DB6-II 

4 

0.53 

5 

3.02% 

7.36% 

kmeans-DB6-l 1 



16 

2.93% 


Random 



3 

2.93% ± 0.03 



Fig. 1.7; Kernel comparisons for power load clustering data. Model selection and 
forecasting results in terms of the mean absolute percentage error (MAPE). RBF- 
DB6-11 refers to using the RBF kernel on the detail coefficients using wavelets 
(DB6, 11 levels). The winner is the Spearman-based kernel with a improvement of 
20.55%. For this kernel, the number of clusters k found by the BFF also coincides 
with the number of aggregators needed to maximize the improvement. 




/10/10 Mon/01/11/10 Tue/02/11/10 Wed/03/11/10 Thu/04/11/10 Fri/05/11/10 Sat/06/11/10 Sun/07/11/10 Mon/08/11/10 Tue/09/11/10 


Fig. 1.8; Power load clustering results. Visualization of the 6 clusters obtained by 
KSC. (Top) Aggregated load in summer. (Bottom) Aggregated load in winter. The 
daily cycles are clearly visible and the clusters capture different characteristics of 
the consumption pattern. This clustering result improves the forecasting accuracy 
by 20.55% 

























20 


1 Kernel Specti'al Clustering and applications 


Best 20 terms 


diabet therapi hospit arteri coronari physi¬ 
cian renal hypertens mortal syndrom car¬ 
diac nurs chronic infect pain cardiovascular 
symptom serum cancer pulmonari 


Cluster 1 


polit war court reform parti legal gender ur¬ 
ban democraci democrat civil capit feder dis¬ 
cours economi justic privat liber union welfar 


Cluster 2 


diet milk fat intak cow dietari fed meat nu- 
Cluster 3 trit fatti chees vitamin ferment fish dry fruit 
antioxid breed pig egg 


Cluster 4 


Cluster 5 


alloi steel crack coat corros fiber concret mi- 
crostructur thermal weld film deform ceram 
fatigu shear powder specimen grain fractur 
glass 

infect hiv vaccin viru immun dog antibodi 
antigen pathogen il per parasit viral bacteri 
dna therapi mice bacteria cat assai 


psycholog cognit mental adolesc emot symp¬ 
tom child anxieti student sexual interview 
school abus psychiatr gender attitud mother 
alcohol item disabl 

text music polit literari philosophi narr 
english moral book essai write discours 
philosoph fiction ethic poetri linguist german 
Christian religi 

firm price busi trade economi invest capit tax 
wage financi compani incom custom sector 
bank organiz corpor stock employ strateg 
nonlinear finit asymptot veloc motion 
stochast elast nois turbul ltd vibrat iter crack 
vehicl infin singular shear polynomi mesh 
fuzzi 


Cluster 6 


Cluster 7 


Cluster 8 


Cluster 9 


soil seed forest crop leaf cultivar seedl ha 
Cluster 10 shoot fruit wheat fertil veget germin rice 
flower season irrig dry weed 


Cluster 11 


soil sediment river sea climat land lake pol- 
lut wast fuel wind ocean atmospher ic emiss 
reactor season forest urban basin 


Best 20 terms 

algebra theorem manifold let finit infin poly- 
Cluster 12 invari omega singular inequ compact 

lambda graph conjectur convex proof asymptot 
bar phi 

pain surgeri injuri lesion muscl bone brain ey 
Cluster 13 surgic nerv mri ct syndrom fractur motor im¬ 
plant arteri knee spinal stroke 
rock basin fault sediment miner ma tecton 
Cluster 14 mantl volcan metamorph seismic sea 

magma faci earthquak ocean cretac crust sed- 
imentari 

web graph fuzzi logic queri schedul semant 

, _ robot machin video wireless neural node inter- 
Cluster 15 . /• , , 

net traffic processor retnev execut fault packet 

student school teacher teach classroom instruct 
Cluster 16 academ curriculum literaci learner colleg 
write profession disabl faculti english cognit 
peer gender 

habitat genu fish sp forest predat egg nest larva 
Cluster 17 reproduct taxa bird season prei nov ecolog is¬ 
land breed mate genera 

star galaxi solar quantum neutrino orbit quark 
, o gravit cosmolog decai nucleon emiss radio nu- 
clei relativist neutron cosmic gaug telescop 
hole 

film laser crystal quantum atom ion beam si nm 
Cluster 19 ^°P^ thermal spin silicon glass scatter dielectr 
voltag excit diffract spectra 
polym catalyst ion bond crystal solvent lig- 
Cluster 20 hydrogen nmi' molecul atom polymer poll 
aqueou adsorpt methyl film spectroscopi elec- 
trod bi 

receptor rat dna neuron mice enzym genom 
Cluster 21 brain mutat peptid kinas inhibitor 

metabol cancer mrna muscl ca2 vitro chromo- 
som 

cancer tumor carcinoma breast therapi pro- 
Cluster 22 chemotherapi tumour surgeri lesion 

lymphoma pancreat recurr resect surgic liver 
lung gastric node 


Table 1.4; Text clustering results. Best 20 terms per cluster according to the inte¬ 
grated results (LSI-tcross-citation) with p — 0.5. The terms found display a coherent 
structure in the clusters. 


1.4.4 Big data 

KSC has been shown to be effective in handling big data at a desktop PC scale. In 
particular, in (Mall et al. 2013h), we focused on community detection in big net¬ 
works containing millions of nodes and several million edges, and we explained 
how to scale our method by means of three step^^ First, we select a smaller sub¬ 
graph that preserves the overall community structure by using the FURS algorithm 


*** A Matlab implementation of the algorithm can he downloaded at: 
http://www.esat.kuleuven.be/stadius/ADB/mall/softwareKSCnet.php 
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(Mall et al. 2013a), where hubs in dense regions of the original graph are selected 
via a greedy activation-deactivation procedure. In this way the kernel matrix related 
to subgraph fits the main memory and the KSC model can be quickly trained by 
solving a smaller eigenvalue problem. Then the BAF criterion described in Section 
1.3.1.31 which is memory and computationally efficient, is used for model selec 


tiorj^^l Finally, the out-of-sample extension is used to infer the cluster memberships 
for the remaining nodes forming the test set (which is divided into chunks due to 
memory constraints). 

In (Mall, Langone & Suykens 20\Ab) the hierarchical clustering technique sum¬ 
marized in Section 1.3.3.2| has been used to perform community detection in real-life 
networks at different resolutions. The method has been shown to be able to detect 
complex structures at various hierarchical levels, by not suffering of any resolution 
limit. An example of results obtained on the Cond-mat network of collaborations be¬ 
tween authors of papers submitted to Condense Matter category in Arxiv (Leskovec 
et al. 2007) is shown in Figure [L^ 

Finally, in (Mall, Jumutc, Langone & Suykens 2014), we propose a deterministic 
method to obtain subsets from big vector data which are a good representative of 
the inherent clustering structure. We first convert the large scale dataset into a sparse 
undirected k-NN graph using a Map-Reduce framework. Then, the FURS method is 
used to select a few representative nodes from this graph, corresponding to certain 
data points in the original dataset. These points are then used to quickly train the 
KSC model, while the generalization property of the method is exploited to compute 
the cluster memberships for the remainder of the dataset. In Figure [1~T0| a summary 
of all these steps is sketched. 


1.5 Conclusions 

In this chapter we have discussed the kernel spectral clustering (KSC) method, 
which is cast in an LS-SVM learning framework. We have explained that, like in 
the classifier case, the clustering model can be trained on a subset of the data with 
optimal tuning parameters, found during the validation stage. The model is then 
able to generalize to unseen test data thanks to its out-of-sample extension property. 
Beyond the core algorithm, some extensions of KSC allowing to produce proba¬ 
bilistic and hierarchical outputs have been illustrated. Furthermore, two different 
approaches to sparsify the model based on the Incomplete Cholesky Decomposition 
(ICD) and Li and Lq penalties have been described. This allows to handle large scale 
data at a desktop scale. Finally, a number of applications in various fields ranging 
from computer vision to text mining have been examined. 


** In (Mall et al. 2013c) this model selection step has been eliminated by proposing a self tuned 
method where the structure of the projections in the eigenspace is exploited to automatically iden¬ 
tify an optimal cluster structure. 
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Fig. 1.9; Large scale community detection. Community structure detected at one 


particular hierarchical level by the AH-KSC method summarized in Section 1.3.3.2 
related to the Cond-Mat collaboration network. 
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Data: map(): -> top k-NN reduce(): k-NN graph 

Fig. 1.10: Big data clustering. (Top) Illustration of the steps involved in cluster¬ 
ing big vector data using KSC. (Bottom) Map-Reduce procedure used to obtain a 
representative training subset by constructing a k-NN graph. 
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